Skip to content

Metadata Retry: Adds Cross-Region Operation-level Retry for Metadata Request Failures#5780

Open
yash2710 wants to merge 2 commits intomainfrom
users/trivediyash/addressRefreshCrossRegion
Open

Metadata Retry: Adds Cross-Region Operation-level Retry for Metadata Request Failures#5780
yash2710 wants to merge 2 commits intomainfrom
users/trivediyash/addressRefreshCrossRegion

Conversation

@yash2710
Copy link
Copy Markdown
Contributor

Description

When a Cosmos DB region becomes unavailable, metadata requests (e.g., partition key range reads) fail with 503/20003 and are retried only within the same region until exhausted, with no cross-region failover. This change enables cross-region retry at both the metadata request level and the operation level.

Problem

MetadataRequestThrottleRetryPolicy retried metadata requests in the same region using a location index, but never marked the failing endpoint as unavailable in the LocationCache. If all retries within the policy were exhausted, the exception propagated without any signal to route future requests to a different region. Additionally, the CosmosException thrown from failed metadata requests was not caught in AbstractRetryHandler, preventing ClientRetryPolicy from evaluating it for cross-region failover. The Change Feed Processor's BootstrapperCore bypasses the handler pipeline entirely and had no retry logic for regional failures.

Changes

MetadataRequestThrottleRetryPolicy - On regional failures (503, 500, 410/LeaseNotFound, 403/DatabaseAccountNotFound, HttpRequestException, non-user OperationCanceledException):

  1. Marks the failing endpoint as unavailable via GlobalEndpointManager.MarkEndpointUnavailableForRead() so the LocationCache deprioritizes that region for future requests
  2. Retries on the next preferred region (existing IncrementRetryIndex logic, limited to PreferredLocationCount)
  3. After exhausting all regions, returns NoRetry so the exception propagates to the operation-level retry policy

AbstractRetryHandler - Added catch (CosmosException) in ExecuteHttpRequestAsync so that metadata failures (which throw CosmosException from CosmosHttpClientCore) reach ClientRetryPolicy.ShouldRetryAsync(Exception) for operation-level cross-region retry.

BootstrapperCore - Added retry logic for CosmosException and HttpRequestException in InitializeAsync() (up to MaxInitializationRetries = 3), since the Change Feed Processor bypasses the handler pipeline and has no ClientRetryPolicy coverage.

Testing

  • 10 unit tests for MetadataRequestThrottleRetryPolicy covering: first-attempt retry, retry exhaustion, regional failure status codes, HttpRequestException, non-user OperationCanceledException, user cancellation, and 429 throttling
  • 3 unit tests for BootstrapperCore covering: CosmosException retry + success, HttpRequestException retry + success, and retry exhaustion
  • All 12 existing RetryHandlerTests continue to pass

Type of change

Please delete options that are not relevant.

  • [] Bug fix (non-breaking change which fixes an issue)
  • [] New feature (non-breaking change which adds functionality)
  • [] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [] This change requires a documentation update

Closing issues

closes #5660

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good!

@yash2710 yash2710 marked this pull request as ready for review April 30, 2026 19:06
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

1 similar comment
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@yash2710 yash2710 changed the title Cross-Region Retry for Metadata Request Failures Metadata Retry: Adds Cross-Region Retry for Metadata Request Failures Apr 30, 2026
@yash2710 yash2710 changed the title Metadata Retry: Adds Cross-Region Retry for Metadata Request Failures Metadata Retry: Adds Cross-Region Operation-level Retry for Metadata Request Failures Apr 30, 2026
@NaluTripician
Copy link
Copy Markdown
Contributor

@sdkReviewAgent-2

await this.synchronizer.CreateMissingLeasesAsync().ConfigureAwait(false);
await this.leaseStore.MarkInitializedAsync().ConfigureAwait(false);
}
catch (CosmosException ex) when (retryCount < MaxInitializationRetries)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation — Overbroad Exception Filter: Catches all CosmosException types including non-transient errors

The when guard only checks retryCount, but the XML doc and comment explicitly say this is for "regional error (e.g., CosmosException with 503 or HttpRequestException)". The code doesn't enforce this — a 400 BadRequest, 404 NotFound, or 409 Conflict will also be caught and retried 3 times with 15-second delays (up to 45 seconds wasted on a deterministic failure).

This is inconsistent with MetadataRequestThrottleRetryPolicy.ShouldRetryInternalAsync() in this same PR, which carefully discriminates by status code (503, 500, 410/LeaseNotFound, 403/DatabaseAccountNotFound).

Concrete failure scenario: If the lease container is deleted mid-initialization, CreateMissingLeasesAsync() throws a 404 CosmosException. This catch retries 3× against the same endpoint with the same result, adding ~45 seconds of latency before finally propagating.

Suggestion: Add a status code filter to align with the policy:

catch (CosmosException ex) when (retryCount < MaxInitializationRetries
    && (ex.StatusCode == HttpStatusCode.ServiceUnavailable
        || ex.StatusCode == HttpStatusCode.InternalServerError
        || (ex.StatusCode == HttpStatusCode.Gone
            && ex.SubStatusCode == (int)SubStatusCodes.LeaseNotFound)
        || (ex.StatusCode == HttpStatusCode.Forbidden
            && ex.SubStatusCode == (int)SubStatusCodes.DatabaseAccountNotFound)))

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

throw;
}
}
catch (CosmosException cosmosException)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation — Missing test coverage for the most critical change in this PR

This new catch (CosmosException) block is the key change enabling the entire operation-level cross-region retry for metadata failures. Without it, CosmosException escapes to SendAsync's outer catch at line 38 and is silently converted to a ResponseMessage without consulting ClientRetryPolicy.

However, there is no test in RetryHandlerTests.cs (or elsewhere) that verifies this path works. The existing RetryHandlerDoesNotRetryOnException test throws a generic Exception, and RetryHandlerHttpClientExceptionRefreshesLocations covers HttpRequestException — but neither exercises the CosmosException catch.

Suggestion: Add a test in RetryHandlerTests.cs following the RetryHandlerHttpClientExceptionRefreshesLocations pattern:

  1. Have the inner handler throw a CosmosException (503)
  2. Configure the mock retry policy to return RetryAfter on the first ShouldRetryAsync(Exception) call
  3. Verify the retry policy is consulted and the handler retries the operation
  4. Add a complementary test verifying that when the policy returns NoRetry, the CosmosException re-throws (not silently converted)

This is especially important because NamedCacheRetryHandler also extends AbstractRetryHandler — its InvalidPartitionExceptionRetryPolicy will now see CosmosException too, and should be verified to handle it safely (it returns NoRetry, causing re-throw, which is correct).

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

mock.Setup(gem => gem.ResolveServiceEndpoint(It.IsAny<DocumentServiceRequest>()))
.Returns(primaryEndpoint);
// Default PreferredLocationCount = 0 → maxRetries = Max(1,0) = 1
mock.Setup(gem => gem.PreferredLocationCount).Returns(0);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation — All tests use PreferredLocationCount = 0, missing multi-region cycling coverage

Every test in this file uses CreateMockEndpointManager with PreferredLocationCount = 0, which yields maxUnavailableEndpointRetryCount = Max(1, 0) = 1. This means only one retry is ever exercised before exhaustion.

In production, most accounts have multiple preferred regions (e.g., 3). The core value proposition of this PR — cycling through regions on metadata failure — is never tested with more than a single retry. If IncrementRetryIndexOnUnavailableEndpointForMetadataRead() had an off-by-one error at higher retry counts, or if OnBeforeSendRequest failed to properly resolve different endpoints on successive retries, no test would catch it.

Suggestion: Add a test with PreferredLocationCount = 3 that:

  1. Returns different endpoints from ResolveServiceEndpoint on each call (region1, region2, region3)
  2. Simulates 3 sequential 503 failures (calling OnBeforeSendRequest + ShouldRetryAsync in a loop)
  3. Asserts each returns ShouldRetry = true with MarkEndpointUnavailableForRead called for the correct endpoint
  4. Asserts the 4th attempt returns NoRetry

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

return Task.FromResult(this.HandleRegionalFailure());
}

if (exception is OperationCanceledException && !cancellationToken.IsCancellationRequested)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Suggestion — Novel pattern: treating non-user OperationCanceledException as a regional failure

This is a new behavioral pattern not present in any sibling retry policy. ClientRetryPolicy.ShouldRetryAsync(Exception) does not treat OperationCanceledException as a regional failure requiring endpoint marking and cross-region retry — it only marks per-partition endpoints and falls through to the throttling policy (which returns NoRetry).

The rationale here (transport timeout = region issue) is reasonable for metadata requests, since a timeout on a metadata read is a strong signal of regional unavailability. But this creates a behavioral divergence between the metadata-level and operation-level policies interpreting the same exception type.

Questions to confirm:

  1. Is this intentional? If so, consider adding a comment explaining why metadata-level OCE handling differs from ClientRetryPolicy.
  2. Should ClientRetryPolicy be aligned in a follow-up PR to also treat non-user OCE as a regional failure for metadata requests?

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

}

[TestMethod]
public async Task InitializeAsync_WithCosmosException_ShouldThrowAfterMaxRetries()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Suggestion — Missing test for HttpRequestException retry exhaustion in BootstrapperCore

The CosmosException path has an exhaustion test (InitializeAsync_WithCosmosException_ShouldThrowAfterMaxRetries) that verifies the exception propagates after MaxInitializationRetries + 1 attempts. The HttpRequestException path has no equivalent exhaustion test — only a retry-and-succeed test.

While the two catch blocks are structurally symmetric, they are separate code paths with independent when guards. A bug in the HttpRequestException catch's guard or retry logic would not be caught by the existing CosmosException exhaustion test.

Suggestion: Add InitializeAsync_WithHttpRequestException_ShouldThrowAfterMaxRetries mirroring the CosmosException exhaustion test.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

/// regions left to try; <see cref="ShouldRetryResult"/> with <c>ShouldRetry = false</c> otherwise,
/// allowing the exception to propagate to the operation-level retry policy.
/// </returns>
private ShouldRetryResult HandleRegionalFailure()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 Observation — Clean consolidation with a subtle behavioral change

HandleRegionalFailure() cleanly consolidates the "mark + increment" pattern and replaces 4 separate code blocks. The structural improvement is welcome.

One subtle behavioral change worth noting: previously, when IncrementRetryIndexOnUnavailableEndpointForMetadataRead() returned false (retries exhausted), the code fell through to this.throttlingRetryPolicy.ShouldRetryAsync(exception, cancellationToken). Now it returns ShouldRetryResult.NoRetry() directly.

This is functionally equivalent today because ResourceThrottleRetryPolicy only retries 429s and returns NoRetry for the status codes handled here (503, 500, 410/LeaseNotFound, 403/DatabaseAccountNotFound). However, it changes the implicit contract: the throttling policy is no longer consulted after region exhaustion. If ResourceThrottleRetryPolicy ever gains broader retry logic in the future, this path would be missed.

Not blocking — just calling it out for awareness.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

}

[TestMethod]
public async Task InitializeAsync_WithCosmosException_ShouldThrowAfterMaxRetries()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Suggestion — Add test for non-retryable CosmosException to document intent

The catch (CosmosException) in BootstrapperCore catches all status codes (see separate comment). Whether this is intentional or a bug, a test explicitly exercising a non-regional CosmosException (e.g., 404 NotFound) would document the expected behavior:

  • If retrying all exceptions is intended: the test asserts 4 attempts and adds a comment explaining why
  • If only regional errors should retry: the test asserts the exception propagates immediately, and the when guard gets a status code filter

Either way, a test closes the ambiguity.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

/// the endpoint as unavailable in the <see cref="IGlobalEndpointManager"/> when
/// a regional failure is detected.
/// </summary>
private Uri locationEndpoint;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 Observation — Design: Marking endpoint unavailable affects ALL reads, not just metadata

MarkEndpointUnavailableForRead() signals the LocationCache to deprioritize this endpoint for all subsequent read operations (queries, point reads, change feed), not just metadata requests. The endpoint moves to the end of the preferred list and self-heals after the 5-minute TTL (DefaultUnavailableLocationsExpirationTimeInSeconds = 300).

This is the correct trade-off: if a region is failing for metadata reads (partition key range lookups), it's very likely failing for data reads too. The impact is soft (deprioritized, not blocked) and self-healing. This aligns with how ClientRetryPolicy.ShouldRetryOnEndpointFailureAsync already works.

Also confirmed: double-marking (from both this policy and ClientRetryPolicy when the exception propagates) is benign — LocationCache.MarkEndpointUnavailable uses ConcurrentDictionary.AddOrUpdate, which is idempotent (just refreshes the TTL timestamp).

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

MaxInitializationRetries,
this.sleepTime);

await Task.Delay(this.sleepTime).ConfigureAwait(false);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Suggestion — Task.Delay without CancellationToken

The review rules require verifying cancellation token propagation on all async methods. Both Task.Delay(this.sleepTime) calls in the new catch blocks don't accept a cancellation token — if the caller wants to cancel during the 15-second retry delay, they can't.

The InitializeAsync() method itself doesn't take a CancellationToken parameter (pre-existing design), so there's no token to pass here. This is a pre-existing limitation of the Bootstrapper interface, not introduced by this PR.

Consider adding CancellationToken support to InitializeAsync() in a follow-up if the team wants cancellable bootstrap initialization.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

@xinlian12
Copy link
Copy Markdown
Member

Review complete (50:07)

Posted 9 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Metadata request (pkranges) 503 failure does not trigger cross-region retry at the operation level

3 participants